Skip to content

Add NaN and NULL count to MIN/MAX stats#3139

Open
markovskipetar wants to merge 4 commits into
masterfrom
add_nan_and_nat_counter_to_col_stats
Open

Add NaN and NULL count to MIN/MAX stats#3139
markovskipetar wants to merge 4 commits into
masterfrom
add_nan_and_nat_counter_to_col_stats

Conversation

@markovskipetar

@markovskipetar markovskipetar commented May 28, 2026

Copy link
Copy Markdown
Collaborator

Add NaN and null counts to MINMAX column stats

Summary

Extends the MINMAX column statistic to also record per-segment counts of NaN (floating point) and null values (NaT timestamps, plus sparse-map gaps such as those from Arrow validity bitmaps), alongside the existing min/max values.

Changes

  • Proto: Added NAN_COUNT_V1 = 3 and NULL_COUNT_V1 = 4 entries to the ColumnStatsType enum in column_stats.proto.
  • Aggregation (unsorted_aggregation.{hpp,cpp}): MinMaxAggregatorData now tracks nan_count_ (floating-point NaNs) and null_count_ (NaT cells plus sparse-map gaps). For sparse columns, the gap count last_row()+1 - row_count() is added to null_count_ so validity-bitmap nulls that the dense iterator never visits are still counted. MinMaxAggregator takes two additional output column names and finalize produces 4 output columns (MIN, MAX, NAN_COUNT, NULL_COUNT) instead of 2.
  • Pipeline (column_stats.cpp): MINMAX external stat now maps to the four internal stat types; v1_NAN_COUNT / v1_NULL_COUNT operator strings and segment column naming wired through.
  • Tests: Added test_column_stats_nan_and_null_counts covering multi-segment symbols with mixed NaN, NaT, and valid values across floating-point and timestamp columns. Widened the shared assert_stats_equal helper to subselect the received frame to the columns the test cares about (so existing min/max-only tests don't have to spell out all-zero count columns), and updated test_column_stats_header_metadata for the 4-entries-per-column header layout.

Backwards compatibility

The new fields are written under new proto enum values (NAN_COUNT_V1=3, NULL_COUNT_V1=4). Existing MINMAX stats written by older clients (only MIN_V1 + MAX_V1) remain readable. Stats written by this version cannot be read by older clients that don't recognise the new enum values or that expect exactly 2 output columns from MinMaxAggregatorData::finalize.

@github-actions

Copy link
Copy Markdown

Label error. Requires exactly 1 of: patch, minor, major. Found:

pl.Series("v1_NAT_COUNT(ts_col)", [0, 1, 2, 1], dtype=pl.UInt64),
)

column_stats = lib.read_column_stats(sym)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new test only covers writing-then-reading with the same (new) client. Please add (or update an existing test) so that the all-NaN / all-NaT segments verify the new v1_NAN_COUNT / v1_NAT_COUNT values directly (e.g. extend test_column_stats_only_nat_values), and add a test for backwards compatibility — column stats written before this change (where only v1_MIN/v1_MAX columns exist) must still be readable, droppable, and mergeable by the new code path.

return {ColumnStatTypeInternal::MIN_V1, ColumnStatTypeInternal::MAX_V1};
return {ColumnStatTypeInternal::MIN_V1,
ColumnStatTypeInternal::MAX_V1,
ColumnStatTypeInternal::NAN_COUNT_V1,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

external_to_internal(MINMAX) now unconditionally returns 4 internal stat types. drop() calls this to construct the list of column names to remove (v1_MIN, v1_MAX, v1_NAN_COUNT, v1_NAT_COUNT). For column stats segments that were written by an older client (only v1_MIN and v1_MAX columns exist), dropping will produce names for columns that aren't in the segment.

Please verify what the downstream consumer of dropped_names does when asked to drop a non-existent column — if it raises, this is a forward-compatibility break that needs handling; if it silently ignores, please add a test that creates column stats with the old format and then drops them with the new client.

seg.add_column(scalar_field(min_col->type().data_type(), output_column_names[0].value), min_col);
seg.add_column(scalar_field(max_col->type().data_type(), output_column_names[1].value), max_col);
seg.add_column(scalar_field(DataType::UINT64, output_column_names[2].value), nan_count_col);
seg.add_column(scalar_field(DataType::UINT64, output_column_names[3].value), nat_count_col);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finalize now unconditionally writes 4 columns (min, max, nan_count, nat_count) and 4 header entries for the MINMAX stat. This changes the on-disk shape of every MINMAX column-stats segment.

Two concerns that should be addressed before merge:

  1. Backwards-compatibility test missing. There is no test that creates column stats with a pinned older ArcticDB version and then reads/drops/recreates them with this new client. Old segments only contain 2 stat columns; merge_column_stats_segments and ColumnStats::ColumnStats(header, tsd) should be exercised against that shape.
  2. Old client reading new data. Although the proto comment relies on the UNKNOWN = 0 fallback in proto3 for forward-compat, the new client also adds two extra real columns to the segment. Please confirm that an older release reading a segment with the extra v1_NAN_COUNT/v1_NAT_COUNT fields doesn't fail descriptor/field-count assertions, and document the result in the PR description.

@claude

claude Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

ArcticDB Code Review Summary

API & Compatibility

  • On-disk format change is called out in the PR description (Backwards compatibility section), and the proto enum is additive.
  • ⚠️ PR description prose is now stale after the latest commit. The enum identifiers (NAN_COUNT_V1=3 / NULL_COUNT_V1=4) still exist, but the Summary and Aggregation sections still say null_count_ tracks "NaT cells plus sparse-map gaps". The latest commit reclassifies in-band NaT to nan_count_ (it is now treated as an in-band sentinel like float NaN); null_count_ is now reserved for sparse-map gaps only. Update the description so it matches the merged semantics.
  • Forward/backward compatibility with old data still not exercised. external_to_internal(MINMAX) returns 4 internal types and MinMaxAggregatorData::finalize unconditionally produces 4 columns; drop() builds removal names v1_MIN/v1_MAX/v1_NAN_COUNT/v1_NULL_COUNT. Behaviour against column-stats segments written by previous releases (only v1_MIN/v1_MAX) still needs a test (read, drop, merge). Not addressed by the latest commits.

Behaviour change (latest commit)

  • NaN/NaT classification is now consistent across types. In-band NaN (float) and in-band NaT (time) both increment nan_count_ and are excluded from min/max; only sparse-map gaps (rows the dense iterator never visits) increment null_count_. This removes the earlier inconsistency where float NaN and time NaT landed in different counters. The change is self-consistent and the comments accurately describe it.

Testing

  • Sparse-map gap counting is tested. test_column_stats_null_count_sparse_floats writes with sparsify_floats=True and asserts v1_NULL_COUNT == 3 / v1_NAN_COUNT == 0, exercising the is_sparse() path including the trailing-gap edge case (last_row()+1 - row_count()).
  • Count columns are validated directly and updated for the new semantics. test_column_stats_nan_count_single_segment and test_column_stats_nan_and_null_counts now assert in-band NaT to v1_NAN_COUNT (was v1_NULL_COUNT), matching the reclassification.
  • No backwards-compatibility test that reads/drops/merges pre-existing 2-column (v1_MIN/v1_MAX only) column-stats segments through the new 4-column code path.
  • ⚠️ test_column_stats_only_nat_values / test_column_stats_only_nan_values still do not assert the new count columns (the assert_stats_equal subselect silently skips them). Lower priority now that dedicated tests cover the count semantics, but the all-NaN/all-NaT degenerate cases remain the natural place to pin them.

PR Title and Description

  • ⚠️ Description needs a refresh for the in-band-NaT to nan_count_ reclassification (see API & Compatibility above).
  • No label set. This is user-visible and needs release notes; add enhancement (or similar) — no-release-notes is not appropriate.

Documentation

  • No update to user-facing column-stats docs describing the new v1_NAN_COUNT / v1_NULL_COUNT columns and the NaN-vs-NULL semantic split. Note the split is now: in-band NaN (float) and in-band NaT (time) to v1_NAN_COUNT; sparse/validity-bitmap gaps to v1_NULL_COUNT. Not addressed by the latest commits.

@markovskipetar markovskipetar changed the title Add NaN and NaT count to MIN/MAX stats Add NaN and NULL count to MIN/MAX stats May 29, 2026
Comment on lines +28 to +36
// Sparse-map gaps are real nulls (e.g. from Arrow validity bitmaps) that the dense
// for_each below never visits. Count them from metadata so they reach null_count_.
if (input_column.column_->is_sparse()) {
const auto sparse_gap_count =
input_column.column_->last_row() + 1 - input_column.column_->row_count();
if (sparse_gap_count > 0) {
null_count_ += static_cast<uint64_t>(sparse_gap_count);
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sparse-map gap counting is a new behavioural code path, but no test in this PR exercises it — test_column_stats_nan_and_null_counts only uses dense pandas frames with in-band NaN/NaT, and test_column_stats_dynamic_schema_missing_data covers slices where the column is wholly absent (a different mechanism — the column isn't aggregated at all, so the expected stats are None rather than non-zero counts). Please add a test that produces a single segment with a sparse column containing gaps (e.g. via an Arrow input with a validity bitmap, or whatever existing fixture produces column->is_sparse() == true with last_row() + 1 > row_count()) and asserts the resulting v1_NULL_COUNT matches the number of gaps.

Also, please double-check the formula: last_row() + 1 - row_count() assumes last_row() (i.e. last_logical_row_) reflects the full logical length of the slice. If the trailing rows of a segment are null, last_logical_row_ may be set to the last present row rather than the segment's logical length, which would under-count trailing nulls. The test should cover that case explicitly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant